[Chapter Eight][Previous]
[Next] [Art of
Assembly][Randall Hyde]
Art of Assembly: Chapter Eight
- 8.8.4 - The CLASS Type
- 8.8.5 - The Read-only Operand
- 8.8.6 - The USE16, USE32, and FLAT Options
- 8.8.7 - Typical Segment Definitions
- 8.8.8 - Why You Would Want to Control
the Loading Order
- 8.8.9 - Segment Prefixes
- 8.8.10 - Controlling Segments with the
ASSUME Directive
- 8.8.11 - Combining Segments: The GROUP
Directive
- 8.8.12 - Why Even Bother With Segments?
8.8.4 The CLASS Type
The final operand to the segment
directive is usually the
class type. The class type specifies the ordering of segments that do not
have the same segment name. This operand consists of a symbol enclosed by
apostrophes (quotation marks are not allowed here). Generally, you should
use the following names: CODE (for segments containing program code); DATA
(for segments containing variables, constant data, and tables); CONST (for
segments containing constant data and tables); and STACK (for a stack segment).
The following program section illustrates their use:
CSEG segment public 'CODE'
mov ax, bx
ret
CSEG ends
DSEG segment public 'DATA'
Item1 byte 0
Item2 word 0
DSEG ends
CSEG segment public 'CODE'
mov ax, 10
add AX, Item1
ret
CSEG ends
SSEG segment stack 'STACK'
STK word 4000 dup (?)
SSEG ends
C2SEG segment public 'CODE'
ret
C2SEG ends
end
The actual loading procedure is accomplished as follows. The assembler locates
the first segment in the file. Since it's a public
combined
segment, MASM concatenates all other CSEG
segments to the end
of this segment. Finally, since its combine class is 'CODE
',
MASM appends all segments (C2SEG
) with the same class afterwards.
After processing these segments, MASM scans the source file for the next
uncombined segment and repeats the process. In the example above, the segments
will be loaded in the following order: CSEG
, CSEG
(2nd occurrence), C2SEG
, DSEG
, and then SSEG
.
The general rule concerning how your files will be loaded into memory is
the following:
- (1) The assembler combines all public segments that have the same name.
- (2) Once combined, the segments are output to the object code file in
the order of their appearance in the source file. If a segment name appears
twice within a source file (and it's public), then the combined segment
will be output to the object code file at the position denoted by the first
occurrence of the segment within the source file.
- (3) The linker reads the object code file produced by the assembler
and rearranges the segments when creating the executable file. The linker
begins by writing the first segment found in the object code file to the
.EXE file. Then it searches throughout the object code file for every segment
with the same class name. Such segments are sequentially written to the
.EXE file.
- (4) Once all the segments with the same class name as the first segment
are emitted to the .EXE file, the linker scans the object code file for
the next segment which doesn't belong to the same class as the previous
segment(s). It writes this segment to the .EXE file and repeats step (3)
for each segment belonging to this class.
- (5) Finally, the linker repeats step (4) until it has linked all the
segments in the object code file.
8.8.5 The Read-only Operand
If readonly
is the first operand of the segment
directive, the assembler will generate an error if it encounters
any instruction that attempts to write to this segment. This is most useful
for code segments, though is it possible to imagine a read-only data segment.
This option does not actually prevent you from writing to this segment at
run-time. It is very easy to trick the assembler and write to this segment
anyway. However, by specifying readonly
you can catch some
common programming errors you would otherwise miss. Since you will rarely
place writable variables in your code segments, it's probably a good idea
to make your code segments readonly
.
Example of READONLY
operand:
seg1 segment readonly para public 'DATA'
.
.
.
seg1 ends
8.8.6 The USE16, USE32, and FLAT Options
When working with an 80386 or later processor, MASM generates different
code for 16 versus 32 bit segments. When writing code to execute in real
mode under DOS, you must always use 16 bit segments. Thirty-two bit segments
are only applicable to programs running in protected mode. Unfortunately,
MASM often defaults to 32 bit mode whenever you select an 80386 or later
processor using a directive like .386,
.486,
or
.586
in your program. If you want to use 32 bit instructions, you
will have to explicitly tell MASM to use 16 bit segments. The use16
,
use32
, and flat
operands to the segment
directive
let you specify the segment size.
For most DOS programs, you will always want to use the use16
operand.
This tells MASM that the segment is a 16 bit segment and it assembles the
code accordingly. If you use one of the directives to activate the 80386
or later instruction sets, you should put use16
in all your
code segments or MASM will generate bad code.
Example of use16
operand:
seg1 segment para public use16 'data'
.
.
.
seg1 ends
The use32
and flat
operands tell MASM to generate
code for a 32 bit segment. Since this text does not deal with protected
mode programming, we will not consider these options. See the MASM Programmer's
Guide for more details.
If you want to force use16
as the default in a program that
allows 80386 or later instructions, there is one way to accomplish this.
Place the following directive in your program before any segments:
.option segment:use16
8.8.7 Typical Segment Definitions
Has the discussion above left you totally confused? Don't worry about
it. Until you're writing extremely large programs, you needn't concern yourself
with all the operands associated with the segment
directive.
For most programs, the following three segments should prove sufficient:
DSEG segment para public 'DATA'
; Insert your variable definitions here
DSEG ends
CSEG segment para public use16 'CODE'
; Insert your program instructions here
CSEG ends
SSEG segment para stack 'STACK'
stk word 1000h dup (?)
EndStk equ this word
SSEG ends
end
The SHELL.ASM file automatically declares these three segments for you.
If you always make a copy of the SHELL.ASM file when writing a new assembly
language program, you probably won't need to worry about segment declarations
and segmentation in general.
8.8.8 Why You Would Want to Control the Loading Order
Certain DOS calls require that you pass the length of your program as
a parameter. Unfortunately, computing the length of a program containing
several segments is a very difficult process. However, when DOS loads your
program into memory, it will load the entire program into a contiguous block
of RAM. Therefore, to compute the length of your program, you need only
know the starting and ending addresses of your program. By simply taking
the difference of these two values, you can compute the length of your program.
In a program that contains multiple segments, you will need to know which
segment was loaded first and which was loaded last in order to compute the
length of your program. As it turns out, DOS always loads the program segment
prefix, or PSP, into memory just before the first segment of your program.
You must consider the length of the PSP when computing the length of your
program. MS-DOS passes the segment address of the PSP in the ds
register.
So computing the difference of the last byte in your program and the PSP
will produce the length of your program. The following code segment computes
the length of a program in paragraphs:
CSEG segment public 'CODE'
mov ax, ds ;Get PSP segment address
sub ax, seg LASTSEG ;Compute difference
; AX now contains the length of this program (in paragraphs)
.
.
.
CSEG ends
; Insert ALL your other segments here.
LASTSEG segment para public 'LASTSEG'
LASTSEG ends
end
8.8.9 Segment Prefixes
When the 80x86 references a memory operand, it usually references a
location within the current data segment. However, you can instruct the
80x86 microprocessor to reference data in one of the other segments using
a segment prefix before an address expression.
A segment prefix is either ds
:, cs:
, ss:
,
es:
, fs:
, or gs:
. When used in front
of an address expression, a segment prefix instructs the 80x86 to fetch
its memory operand from the specified segment rather than the default segment.
For example, mov ax, cs:I[bx]
loads the accumulator from address
I+bx
within the current code segment. If the cs:
prefix were absent, this instruction would normally load the data from the
current data segment. Likewise, mov ds:[bp],ax
stores the accumulator
into the memory location pointed at by the bp
register in the
current data segment (remember, whenever using bp
as a base
register it points into the stack segment).
Segment prefixes are instruction opcodes. Therefore, whenever you use a
segment prefix you are increasing the length (and decreasing the speed)
of the instruction utilizing the segment prefix. Therefore, you don't want
to use segment prefixes unless you have a good reason to do so.
8.8.10 Controlling Segments with the ASSUME Directive
The 80x86 generally references data items relative to the ds
segment
register (or stack segment). Likewise, all code references (jumps, calls,
etc.) are always relative to the current code segment. There is only one
catch - how does the assembler know which segment is the data segment and
which is the code segment (or other segment)? The segment
directive
doesn't tell you what type of segment it happens to be in the program. Remember,
a data segment is a data segment because the ds
register points
at it. Since the ds
register can be changed at run time (using
an instruction like mov ds,ax
), any segment can be a data segment.
This has some interesting consequences for the assembler. When you specify
a segment in your program, not only must you tell the CPU that a segment
is a data segment, but you must also tell the assembler where and when that
segment is a data (or code/stack/extra/F/G) segment. The assume
directive
provides this information to the assembler.
The assume
directive takes the following form:
assume {CS:seg} {DS:seg} {ES:seg} {FS:seg} {GS:seg} {SS:seg}
The braces surround optional items, you do not type the braces as part of
these operands. Note that there must be at least one operand. Seg
is either the name of a segment (defined with the segment
directive)
or the reserved word nothing
. Multiple operands in the operand
field of the assume
directive must be separated by commas.
Examples of valid assume directives:
assume DS:DSEG
assume CS:CSEG, DS:DSEG, ES:DSEG, SS:SSEG
assume CS:CSEG, DS:NOTHING
The assume
directive tells the assembler that you have loaded
the specified segment register(s) with the segment addresses of the specified
value. Note that this directive does not modify any of the segment registers,
it simply tells the assembler to assume the segment registers are pointing
at certain segments in the program. Like the processor selection and equate
directives, the assume directive modifies the assembler's behavior from
the point MASM encounters it until another assume
directive
changes the stated assumption.
Consider the following program:
DSEG1 segment para public 'DATA'
var1 word ?
DSEG1 ends
DSEG2 segment para public 'DATA'
var2 word ?
DSEG2 ends
CSEG segment para public 'CODE'
assume CS:CSEG, DS:DSEG1, ES:DSEG2
mov ax, seg DSEG1
mov ds, ax
mov ax, seg DSEG2
mov es, ax
mov var1, 0
mov var2, 0
.
.
.
assume DS:DSEG2
mov ax, seg DSEG2
mov ds, ax
mov var2, 0
.
.
.
CSEG ends
end
Whenever the assembler encounters a symbolic name, it checks to see which
segment contains that symbol. In the program above, var1
appears
in the DSEG1
segment and var2
appears in the DSEG2
segment. Remember, the 80x86 microprocessor doesn't know about segments
declared within your program, it can only access data in segments pointed
at by the cs, ds, es, ss, fs,
and gs
segment registers.
The assume
statement in this program tells the assembler the
ds
register points at DSEG1
for the first part
of the program and at DSEG2
for the second part of the program.
When the assembler encounters an instruction of the form mov var1,0
,
the first thing it does is determine var1
's segment. It then
compares this segment against the list of assumptions the assembler makes
for the segment registers. If you didn't declare var1
in one
of these segments, then the assembler generates an error claiming that the
program cannot access that variable. If the symbol (var1
in
our example) appears in one of the currently assumed segments, then the
assembler checks to see if it is the data segment. If so, then the instruction
is assembled as described in the appendices. If the symbol appears in a
segment other than the one that the assembler assumes ds
points
at, then the assembler emits a segment override prefix byte, specifying
the actual segment that contains the data.
In the example program above, MASM would assemble mov VAR1,0
without
a segment prefix byte. MASM would assemble the first occurrence of the mov
VAR2,0
instruction with an es:
segment prefix byte since
the assembler assumes es
, rather than ds
, is pointing
at segment DSEG2
. MASM would assemble the second occurrence
of this instruction without the es:
segment prefix byte since
the assembler, at that point in the source file, assumes that ds
points
at DSEG2
. Keep in mind that it is very easy to confuse the
assembler. Consider the following code:
CSEG segment para public 'CODE'
assume CS:CSEG, DS:DSEG1, ES:DSEG2
mov ax, seg DSEG1
mov ds, ax
.
.
.
jmp SkipFixDS
assume DS:DSEG2
FixDS: mov ax, seg DSEG2
mov ds, ax
SkipFixDS:
.
.
.
CSEG ends
end
Notice that this program jumps around the code that loads the ds
register
with the segment value for DSEG2
. This means that at label
SkipFixDS
the ds
register contains a pointer to
DSEG1
, not DSEG2
. However, the assembler isn't
bright enough to realize this problem, so it blindly assumes that ds
points at DSEG2
rather than DSEG1
. This
is a disaster waiting to happen. Because the assembler assumes you're accessing
variables in DSEG2
while the ds
register actually
points at DSEG1
, such accesses will reference memory locations
in DSEG1
at the same offset as the variables accessed in DSEG2
.
This will scramble the data in DSEG1
(or cause your program
to read incorrect values for the variables assumed to be in segment DSEG2
).
For beginning programmers, the best solution to the problem is to avoid
using multiple (data) segments within your programs as much as possible.
Save the multiple segment accesses for the day when you're prepared to deal
with problems like this. As a beginning assembly language programmer, simply
use one code segment, one data segment, and one stack segment and leave
the segment registers pointing at each of these segments while your program
is executing. The assume
directive is quite complex and can
get you into a considerable amount of trouble if you misuse it. Better not
to bother with fancy uses of assume
until you are quite comfortable
with the whole idea of assembly language programming and segmentation on
the 80x86.
The nothing
reserved word tells the assembler that you haven't
the slightest idea where a segment register is pointing. It also tells the
assembler that you're not going to access any data relative to that segment
register unless you explicitly provide a segment prefix to an address. A
common programming convention is to place assume
directives
before all procedures in a program. Since segment pointers to declared segments
in a program rarely change except at procedure entry and exit, this is the
ideal place to put assume directives:
assume ds:P1Dseg, cs:cseg, es:nothing
Procedure1 proc near
push ds ;Preserve DS
push ax ;Preserve AX
mov ax, P1Dseg ;Get pointer to P1Dseg into the
mov ds, ax ; ds register.
.
.
.
pop ax ;Restore ax's value.
pop ds ;Restore ds' value.
ret
Procedure1 endp
The only problem with this code is that MASM still assumes that ds
points at P1Dseg
when it encounters code after Procedure1
.
The best solution is to put a second assume directive after the endp
directive to tell MASM it doesn't know anything about the value in the ds
register:
.
.
.
ret
Procedure1 endp
assume ds:nothing
Although the next statement in the program will probably be yet another
assume
directive giving the assembler some new assumptions
about ds
(at the beginning of the procedure that follows the
one above), it's still a good idea to adopt this convention. If you fail
to put an assume
directive before the next procedure in your
source file, the assume ds:nothing
statement above will keep
the assembler from assuming you can access variables in P1Dseg
.
Segment override prefixes always override any assumptions made by the assembler.
mov ax, cs:var1
always loads the ax
register with
the word at offset var1
within the current code segment, regardless
of where you've defined var1
. The main purpose behind the segment
override prefixes is handling indirect references. If you have an instruction
of the form mov ax,[bx]
the assembler assumes that bx
points into the data segment. If you really need to access data in a different
segment you can use a segment override, thusly, mov ax, es:[bx]
.
In general, if you are going to use multiple data segments within your program,
you should use full segment:offset names for your variables. E.g., mov
ax, DSEG1:I
and mov bx,DSEG2:J
. This does not eliminate
the need to load the segment registers or make proper use of the assume
directive, but it will make your program easier to read and help
MASM locate possible errors in your program.
The assume
directive is actually quite useful for other things
besides just setting the default segment. You'll see some more uses for
this directive a little later in this chapter.
8.8.11 Combining Segments: The GROUP Directive
Most segments in a typical assembly language program are less than 64
Kilobytes long. Indeed, most segments are much smaller than 64 Kilobytes
in length. When MS-DOS loads the program's segments into memory, several
of the segments may fall into a single 64K region of memory. In practice,
you could combine these segments into a single segment in memory. This might
possibly improve the efficiency of your code if it saves having to reload
segment registers during program execution.
So why not simply combine such segments in your assembly language code?
Well, as the next section points out, maintaining separate segments can
help you structure your programs better and help make them more modular.
This modularity is very important in your programs as they get more complex.
As usual, improving the structure and modularity of your programs may cause
them to become less efficient. Fortunately, MASM provides a directive, group
,
that lets you treat two segments as the same physical segment without abandoning
the structure and modularity of your program.
The group
directive lets you create a new segment name that
encompasses the segments it groups together. For example, if you have two
segments named "Module1Data
" and "Module2Data
"
that you wish to combine into a single physical segment, you could use the
group directive as follows:
ModuleData group Module1Data, Module2Data
The only restriction is that the end of the second module's data must be
no more than 64 kilobytes away from the start of the first module in memory.
MASM and the linker will not automatically combine these segments and place
them together in memory. If there are other segments between these two in
memory, then the total of all such segments must be less than 64K in length.
To reduce this problem, you can use the class operand to the segment directive
to tell the linker to combine the two segments in memory by using the same
class name:
ModuleData group Module1Data, Module2Data
Module1Data segment para public 'MODULES'
.
.
.
Module1Data ends
.
.
.
Module2Data segment byte public 'MODULES'
.
.
.
Module2Data ends
With declarations like those above, you can use "ModuleData
"
anywhere MASM allows a segment name, as the operand to a mov
instruction, as an operand to the assume
directive, etc. The
following example demonstrates the usage of the ModuleData
segment name:
assume ds:ModuleData
Module1Proc proc near
push ds ;Preserve ds' value.
push ax ;Preserve ax's value.
mov ax, ModuleData ;Load ds with the segment address
mov ds, ax ; of ModuleData.
.
.
.
pop ax ;Restore ax's and ds' values.
pop ds
ret
Module1Proc endp
assume ds:nothing
Of course, using the group
directive in this manner hasn't
really improved the code. Indeed, by using a different name for the data
segment, one could argue that using group
in this manner has
actually obfuscated the code. However, suppose you had a code sequence that
needed to access variables in both the Module1Data
and Module2Data
segments. If these segments were physically and logically separate
you would have to load two segment registers with the addresses of these
two segments in order to access their data concurrently. This would cost
you a segment override prefix on all the instructions that access one of
the segments. If you cannot spare an extra segment register, the situation
will be even worse, you'll have to constantly load new values into a single
segment register as you access data in the two segments. You can avoid this
overhead by combining the two logical segments into a single physical segment
and accessing them through their group rather than individual segment names.
If you group two or more segments together, all you're really doing is creating
a pseudo-segment that encompasses the segments appearing in the group
directive's operand field. Grouping segments does not prevent you from accessing
the individual segments in the grouping list. The following code is perfectly
legal:
assume ds:Module1Data
mov ax, Module1Data
mov ds, ax
.
< Code that accesses data in Module1Data >
.
assume ds:Module2Data
mov ax, Module2Data
mov ds, ax
.
< Code that accesses data in Module2Data >
.
assume ds:ModuleData
mov ax, ModuleData
mov ds, ax
.
< Code that accesses data in both Module1Data and Module2Data >
.
.
.
When the assembler processes segments, it usually starts the location counter
value for a given segment at zero. Once you group a set of segments, however,
an ambiguity arises; grouping two segments causes MASM and the linker to
concatenate the variables of one or more segments to the end of the first
segment in the group list. They accomplish this by adjusting the offsets
of all symbols in the concatenated segments as though they were all symbols
in the same segment. The ambiguity exists because MASM allows you to reference
a symbol in its segment or in the group segment. The symbol has a different
offset depending on the choice of segment. To resolve the ambiguity, MASM
uses the following algorithm:
- If MASM doesn't know that a segment register is pointing at the symbol's
segment or a group containing that segment, MASM generates an error.
- If an
assume
directive associates the segment name with
a segment register but does not associate a segment register with the group
name, then MASM uses the offset of the symbol within its segment.
- If an
assume
directive associates the group name with a
segment register but does not associate a segment register with the symbol's
segment name, MASM uses the offset of the symbol with the group.
- If an
assume
directive provides segment register association
with both the symbol's segment and its group, MASM will pick the offset
that would not require a segment override prefix. For example, if the assume
directive specifies that ds
points at the group name and es
points at the segment name, MASM will use the group offset if the default
segment register would be ds
since this would not require MASM
to emit a segment override prefix opcode. If either choice results in the
emission of a segment override prefix, MASM will choose the offset (and
segment override prefix) associated with the symbol's segment.
MASM uses the algorithm above if you specify a variable name without a segment
prefix. If you specify a segment register override prefix, then MASM may
choose an arbitrary offset. Often, this turns out to be the group offset.
So the following instruction sequence, without an assume directive telling
MASM that the BadOffset
symbol is in seg1
may
produce bad object code:
DataSegs group Data1, Data2, Data3
.
.
.
Data2 segment
.
.
.
BadOffset word ?
.
.
.
Data2 ends
.
.
.
assume ds:nothing, es:nothing, fs:nothing, gs:nothing
mov ax, Data2 ;Force ds to point at data2 despite
mov ds, ax ; the assume directive above.
mov ax, ds:BadOffset ;May use the offset from DataSegs
; rather than Data2!
If you want to force the correct offset, use the variable name containing
the complete segment:offset address form:
; To force the use of the offset within the DataSegs group use an instruction
; like the following:
mov ax, DataSegs:BadOffset
; To force the use of the offset within Data2, use:
mov ax, Data2:BadOffset
You must use extra care when working with groups within your assembly language
programs. If you force MASM to use an offset within some particular segment
(or group) and the segment register is not pointing at that particular segment
or group, MASM may not generate an error message and the program will not
execute correctly. Reading the offsets MASM prints in the assembly listing
will not help you find this error. MASM always displays the offsets within
the symbol's segment in the assembly listing. The only way to really detect
that MASM and the linker are using bad offsets is to get into a debugger
like CodeView and look at the actual machine code bytes produced by the
linker and loader.
8.8.12 Why Even Bother With Segments?
After reading the previous sections, you're probably wondering what
possible good could come from using segments in your programs. To be perfectly
frank, if you use the SHELL.ASM file as a skeleton for the assembly language
programs you write, you can get by quite easily without ever worrying about
segments, groups, segment override prefixes, and full segment:offset names.
As a beginning assembly language programmer, it's probably a good idea to
ignore much of this discussion on segmentation until you are much more comfortable
with 80x86 assembly language programming. However, there are three reasons
you'll want to learn more about segmentation if you continue writing assembly
language programs for any length of time: the real-mode 64K segment limitation,
program modularity, and interfacing with high level languages.
When operating in real mode, segments can be a maximum of 64 kilobytes long.
If you need to access more than 64K of data or code in your programs, you
will need to use more than one segment. This fact, more than any other reason,
has dragged programmers (kicking and screaming) into the world of segmentation.
Unfortunately, this is as far as many programmers get with segmentation.
They rarely learn more than just enough about segmentation to write a program
that accesses more than 64K of data. As a result, when a segmentation problem
occurs because they don't fully understand the concept, they blame segmentation
for their problems and they avoid using segmentation as much as possible.
This is too bad because segmentation is a powerful memory management tool
that lets you organize your programs into logical entities (segments) that
are, in theory, independent of one another. The field of software engineering
studies how to write correct, large programs. Modularity and independence
are two of the primary tools software engineers use to write large programs
that are correct and easy to maintain. The 80x86 family provides, in hardware,
the tools to implement segmentation. On other processors, segmentation is
enforced strictly by software. As a result, it is easier to work with segments
on the 80x86 processors.
Although this text does not deal with protected mode programming, it is
worth pointing out that when you operate in protected mode on 80286 and
later processors, the 80x86 hardware can actually prevent one module from
accessing another module's data (indeed, the term "protected mode"
means that segments are protected from illegal access). Many debuggers available
for MS-DOS operate in protected mode allowing you to catch array and segment
bounds violations. Soft-ICE and Bounds Checker from NuMega are examples
of such products. Most people who have worked with segmentation in a protected
mode environment (e.g., OS/2 or Windows) appreciate the benefits that segmentation
offers.
Another reason for studying segmentation on the 80x86 is because you might
want to write an assembly language function that a high level language program
can call. Since the HLL compiler makes certain assumptions about the organization
of segments in memory, you will need to know a little bit about segmentation
in order to write such code.
- 8.8.4 - The CLASS Type
- 8.8.5 - The Read-only Operand
- 8.8.6 - The USE16, USE32, and FLAT Options
- 8.8.7 - Typical Segment Definitions
- 8.8.8 - Why You Would Want to Control
the Loading Order
- 8.8.9 - Segment Prefixes
- 8.8.10 - Controlling Segments with the
ASSUME Directive
- 8.8.11 - Combining Segments: The GROUP
Directive
- 8.8.12 - Why Even Bother With Segments?
Art of Assembly: Chapter Eight - 26 SEP 1996
[Chapter Eight][Previous]
[Next] [Art of
Assembly][Randall Hyde]